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Twitter messages often contain so-called hashtags to denote keywords related 
to them. Using a dataset of 29 million messages, I explore relations among these 
hashtags with respect to co-occurrences. Furthermore, I present an attempt to 
classify hashtags into five intuitive classes, using a machine-learning approach. 
The overall outcome is an interactive Web application to explore Twitter hash- 
^ tags. 
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1 Introduction 



Twitte^\ is a fast-growing social Web application allowing its users to publish and 
communicate with very short messages, so-called tweets, limited to 140 characters 
each. In the first half of 2010, there were over 100 million users registered at 
Twitter [16J, composing more than 65 million tweets per day j2]. 

Naturally, the language used in tweets is characterized by many abbreviations 
(e.g. 4 U) and emoticons (e.g. :)), like in SMS. However, there are also very 
Twitter-specific forms of annotations, most notably so-called ©-replies and hash- 
tags, like in the following tweet: 

Omerazindagi Thanks! Will make more 4 U. Live performances in 
#boulder area will be on http://saxy.us :) #jazz #rock #funk 
#dance #livemusic 

Hashtags are simply words that are preceded by a hash (#). They can be used 
both inside the text and at its end to annotate keywords for a tweet. Twitter dis- 
plays each hashtag as a link to a page listing other tweets containing the hashtag; 
that is where the "tag" in "hashtags" comes from, as they serve a similar purpose 
as tags on websites like Flicki[^]and Delicious^] 

The problem with many hashtags is that, just from their name, it is often 
impossible to tell what they are about (e.g. #tcot, #p2, #sgp). This problem 
might (at least partially) be solved by the two approaches described in this work: a 
dictionary built upon co-occurrences (section [3]) and a machine-learnt classification 
into basic classes (section [4]) , plugged into an interactive Web application (section 
|5l 



2 Dataset and pre-processing 

A nice feature of Twitter, at least from a researcher's point of view, is its open 
API to access tweets. It just can take a long time to crawl many tweets due to 
rate limits. I was lucky to get a dataset of 29 million tweets from Munmun De 
Cloudhury [2], crawled from November 2008 to November 2009, with the majority 
of tweets crawled in the last month, as can be seen in Figure [T] Although this 
bias towards later tweets is unlikely to have effects on the studies of hashtag 
co-occurrences and classification, this could be subject to further investigations. 

To reduce noise, I focused on those 85,503 hashtags (out of about 310,000) that 
occur in at least three tweets in this dataset. They correspond to 2,800,027 tweets 
where at least one of these "relevant" hashtags occurs. 

All programming was done in the Python programming languagd^ which is 
particularly attractive for language processing because of the Natural Language 
Toolkit (nltk) [Tj. To handle the large dataset, a W7ioos/]|^] index of all the relevant 
tweets was built, esp. to have fast access to the tweets containing a certain hashtag. 

^http : / / twitt er . com/ 1 

2 http : //www.f lickr . com/photos/tags/ 
u http: //www . delicious . com/?view=tags 

4 http : //www .python. org7 

d http : //bitbucket . org/mchaput /whoosh/ 
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Figure 1: Timeline of tweets per day in the provided dataset. 



c/o | b/c | w/o I w/ I \ + / - I (?# common abbreviations, +/- ) 

\d+(?:[.,:/-]\d+)+| (?# numbers , fractions, dates, etc. ) 

(?:[:;][-=]?!=) [Dp(|)] [DO]*|<3+| (?# smileys ) 

(? : https ?\ : \/\/ I www\ . ) [a-zA-ZO -9/ . ?=&\ -#] * [a-zA-Z0 -9/] I 

(?# URLs ) 

[#@]\w+| (?# hashtags , @-replies ) 

\w+ (? : -\w + ) * (? : ' \w + ) ? I (?# ordinary words ) 
[$£€¥$§<§ &7„ \ + ~] + (?# special symbols ) 



Listing 1: The regular expression used to find words in a tweet text. Spaces, linebreaks, and 
comments were added for readability; the standard Python regular expression syntax is used. 

As the language used in tweets differs in some ways very much from ordinary 
languages, a special way of tokenizing texts was needed. Especially, hashtags, 
©-replies, and URLs should be preserved in the tokenization process. See Listing 
[T] for the custom regular expression that was used to find all the tokens in a given 
text. This differs from the maybe more common approach of splitting the text at 
certain delimeters, which I found impractical as e.g. a slash (/) might denote a 
delimeter (as in this/that) as well as an abbreviation marker (b/c), or it might 
be used inside links. 

3 Hashtag co-occurrences 
3.1 Dictionary 

Inspired by the Web 2.0 dictionary [TB], I built a "dictionary" of hashtags defined 
by co-occurring hashtags. The co-occurrence count of two hashtags hi, hj can be 
formally defined as 

C(hi, hj) : = | {tweet t \ hi € t A hj G t}\ . 

Given a hashtag h, let its dictionary entry D{h) consist of the ten hashtags hj ^ h 
with the highest co-occurrence counts C(h,hj), in descending order. 

In total, there are 1,462,215 pairs of hashtags which co-occur. Storing their 
co-occurrence counts C(hi, hj) in a Python dictionary consumes about 190 MB of 
memory, which would be feasible for single calculations, but probably not in a Web 
application, as introduced in section [5] running on a (multi-site) webserver with 
very limited resources. Apart from that, a memory-based storage does not scale 
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#mac 
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#mac 
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#tlot 


442 


#ipod 
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#linux 


60 


#politics 


379 


#itues 


158 


#vista 


57 


#gop 


363 


#google 


113 


#windows7 


51 


#healthcare 


275 


#imac 


105 


#win7 


47 


#p2 


249 


#tech 


95 


#sof tware 


38 


#sgp 


239 


#microsof t 


95 


#xp 


37 


#nobel 


217 


#fail 


86 


#ubuntu 


34 


#tea 


183 


#snowleopard 


85 


#iphone 


30 



Table 1: Co-occurrence lists for three hashtags h. 



well, and thinking of this application as a prototype for a more extensive version 
employing more data, another solution had to be found. That is why I created an 
SQLit^] database file for storing the co-occurrence counts of all relevant hashtags. 

3.2 Evaluation 

Three examples of the resulting dictionary entries can be seen in Table [T] The 
results seem pretty reasonable, but the question is of course how to formally eval- 
uate this dictionary. The least thing one would expect from such a dictionary is 
that the words in D{h) are somehow related to h. My idea was that the inten- 
sity of a relation between two words can actually be measured by examining the 
path between them in the WordNet [32] lexical database of hypernym/hyponym 
relations. 

The main problem with this approach is, of course, that many hashtag names 
will not appear in the WordNet corpus, either because they are non-English words 
or because they are not real words at all. Still, if the dictionary provides strong 
relations for hashtags restricted on WordNet entries (resulting in a set i?wN of 
13,791 hashtags), this is good indication that it works in general. 

Another issue is that WordNet does not deal with individual words {lemmas), 
but with synsets, which are simply groups of synonymous lemmas. The basic 
assumption of my evaluation is that two words are related as much as the most 
related pair of respective synsets is, or formally 

S(hi,h 2 ) := max S , (si,s 2 ), 

synset s\ s.t. 
synset S2 s.t. h,2&S2 

where S(-, •) denotes the similarity between two hashtags or two synsets, respec- 
tively. 

For computing the actual similarity, two similarity measures were used: 



http : //www . sqlite . org/ 
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1. the path distance similarity, which is defined as 

Spath(si3 s 2) := -77 TTT' 

d(si,s 2 ) + l 

where d(si,s 2 ) denotes the length between the two synsets s\, s 2 in the 
taxonomy, and 

2. the Wu-Palmer distance [Ej, which is defined as 

2d{s(s 1 ,s 2 )) 
i>WP{si,s 2 ) •= -7-, — . , — r, 
d(si) + d(s 2 ) 

where s(si,s 2 ) denotes the lowest common subsumer of s\, s 2 , and d(s) the 
depth of synset s in the taxonomy 

Python functions for both measures are already implemented in the WordNet 
module of nltk [T]. The following calculations were performed: 

1. For every hashtag h in -ffwN and respective co-occurring hashtags hi G D(h)f] 
ff\VN compute their similarities 5 pa th,wp(^) ^f)- 

2. As a first baseline, for every hashtag h in -ff\yN and ten random hashtags 
ri € Hwn, compute their similarity Sp a th,wp(^> r «)- 

3. As a second baseline, for 10,000 lemmas I in WordNet and ten random other 
lemmas k, compute their similarities 5 , pat h,wp(^ k)- 

The first baseline acts as a baseline in Twitter, i.e. it discards co-occurrance 
information and, given a hashtag, considers similarities to arbitrary other hash- 
tags. The second baseline is a general measure in WordNet, reflecting the average 
similarity among lemmas. 

Taking averages of all respective similarities, the following results were obtained: 

average S^ath average S\yp 



Co-occurrences 



Baseline (Twitter) 
Baseline (Wordnet) 



0.12 0.37 



0.07 0.26 
0.05 0.16 



This shows a significantly higher similarity between co-occurring hashtags than 
between arbitrary pairs of hashtags and between random pairs of words. In par- 
ticular, the WordNet path between co-occurring hashtags is only about half as 
long as between random hashtags or words. 

3.3 Clustered graph 

To visualize hashtags and their relations, I created a graph of the 1000 most 
frequent hashtags. Edges were created among the 600 pairs of hashtags with 
the highest co-occurrence counts, with a weight corresponding to this count. To 
find structures in the graph, it was partitioned into 20 parts using kmetis [T2| ITS"]. 
This graph partitioning basically minimizes the weight of edge cuts. The resulting 
graph was then ploted using the spring layout [6j in the NetworkX library [8]. 



The result can be seen in Figure 2(a) It exhibits some really interesting relations 
among hashtags, and the clustering seems to catch many actual topic fields. For 
instance, there is a relatively clear cluster of hashtags related to U.S. politics, 



another one for jobs, and one for the German election in 2009; see Figure [2(b)| 
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(a) principal connected component (b) German election cluster in the upper right 

corner, featuring political parties 



Figure 2: Clustered graph of hashtags. Clusters are visualized by different colors, both for 
nodes and edges. Edges connecting nodes in different clusters are colored gray. 

4 Classification of hashtags 

4.1 Classes 

Aside from retrieving related hashtags, it might be of particular interest to classify 
hashtags. This is related to named entity recognition: In most cases, hashtags 
represent named entities, apart from emotions, e.g. #f ail, or general categories, 
e.g. #photography. Put it this way, the recognition of (the most relevant) named 
entities in tweets is trivial — they are usually represented by hashtags. 

What remains is the classification of these named entities and other hashtags. 
As a first approach to this goal, I took the common named entity classes organi- 
zation, geolocation, and person, and added the classes event, as event recognition 
might be of particular interest on Twitter, and category which basically contains 
all other hashtags that do not fit into any other class, like emotions, fields of in- 
terest, and even products. See Table [2] for some examples of hashtags for each 
category. 

4.2 Machine-learning 

The basic approach was to use a maximum entropy (MaxEnt) classifier jl H pages 
235-241] to soft-classify the use of a hashtag in tweets it occurs in, and then classify 
the hashtag as the average of all these classifications. MaxEnt was chosen mainly 
because it performs very well in [S] on the one hand, and is readily implemented 
in nltk on the other hand. Employing SciPy [ID], the advanced limited memory 
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#bp 


#sziget 
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Table 2: Hashtag classes with examples. 



Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [J3] can be used to solve 
the underlying convex optimization problem. 

Given a hashtag h in the text of a tweet, the following (binary-encoded) classi- 
fication features were used: 

1. the words in a window of size 5 around h, excluding the hashtag itself, 

2. the shape feature [HI pages 764-765] of each of these words (see Table [3]), 

3. the part-of-speech tags of these words (see section 4.3), 

4. geographical background knowledge for these words (see section 4.4), 

5. the shape feature of h itself, without the leading hash (#), 

6. an indicator whether h is the first word (token) in the tweet, 

7. a position indicating in which fifth of the tweet (with respect to word indices) 
h is, 

8. an indicator whether all the words following h in the tweet are hashtags, 

9. the five most co-occurring hashtags of h, and 
10. geographical background knowledge for them. 



4.3 Part-of-speech tagging 

As part-of-speech tagger, I chose HunPos (9], which is an open source reimple- 
mentation of the popular TnT tagger [3] . It has an easy-to-use interface included 
in nltk and trained models for English text available for download. The latter 
might not be completely adequate for Twitter text, but the results are reason- 
able, so no further work was invested in that direction, so far. Here are two 
part-of-speech-tagged sentences from the introductory example: 

Will/MD make/VB more/JJR 4/CD u/NN. Live/JJ perf ormances/NNS in/IN 
#bolder/NN area/NN will/MD be/VB on/IN http://www.saxy.us/JJ. 



4.4 Geographical background knowledge 

To ease the recognition of geolocations, features indicating whether a word is 
a city, region, or country name, respectively, are provided, complemented by a 
feature indicating whether the word is either of these. The set of geospatial names 
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Table 3: Shape features used in the classifier. 



was acquired from Geonames |7J and includes names of 219,833 cities having a 
population of at least 1000 inhabitants, 29,615 administrative regions, and 497 
countries, each including alternate names in languages other than English. 



4.5 Evaluation 

For training the classifier and its evaluation, a set of 41 organization, 40 geoloca- 
tion, 26 person, 16 event, and 57 category hashtags was classified by hand, yielding 
a total of 180 hashtags as "gold standard." To reduce computational effort, not 
all tweets containing a certain hashtag are used for training resp. classification, 
but only 100 random tweets. 

The classifier was evaluated using 5-fold cross-validation. I chose five subsamples 
just because of execution time, as processing ten subsamples would have taken too 
long. They are selected randomly, consisting of 36 human-labeled hashtags each. 
So there are five evaluation phases, where in each the MaxEnt classifier is trained 
using 4 x 36 = 144 hashtags and 100 random tweets each. Then the remaining 36 
hashtags are classified by computing the average of the classifications of, again, 
100 random tweets each. 

The resulting confusion matrix can be seen in Table [4j Classification of geolo- 
cations and categories is "okay" (with a lot of room for improvement, still), while 
classification of events does not work at all, unfortunately. 



5 Web application 

To browse hashtags, their co-occurrence dictionaries, and classifications, I created 
the Web application Twitter Explorer. It is built using the Django Web frameworlsFl 
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Precision P 


0.52 


0.13 


0.61 


0.35 


0.41 




Recall R 


0.60 


0.06 


0.58 


0.39 


0.35 




Fi 


0.55 


0.08 


0.59 


0.37 


0.38 



Table 4: Confusion matrix of the hashtag classifier according to 5-fold cross-validation. 

The JavaScript libraries -Profoiyp^] and RGrapf^\ are used for interface design. 
To reduce page sizes, classification details for individual tweets are loaded using 
AJAX requests. As of this writing, the application is publicly available at |http : | 



//twex . poeschko . com 



6 Future work 

The are a number of ways in which this work could be improved or extended: 

• So far, only the precision of the co-occurrence dictionary is evaluated, i.e. the 
question, "how similar are the retrieved hashtags to the given hashtag." It 
would also be interesting to measure its recall, i.e. "how many of the similar 
hashtags are actually retrieved." 



The clustered graph presented in section 3.3 is only available as separate 
image file, so far. It would be nice to include it in the Web interface some- 
how, maybe complemented by other graph illustrations of hashtags and their 
"surroundings." 

The hashtag classifier certainly needs further improvement. In general, more 
training examples and a closer investigation of the feature selection might 
help. In order to enable event detection, features like the time distribution 
of tweets (e.g. its entropy) are certainly needed. 

The dataset also includes "social" information in the form of follower /follow- 
ing relationships. Maybe this can be employed in the classifier as well as in 
the ways hashtags can be browsed in TwitterExplorer. Moreover, the "social 
relevance" of Twitter hashtags will be subject to further research. 

Detecting hypernomy/hyponomy relations among hashtags would be a nice 
additional feature. It might be done statistically or by machine learning. 



http : //www .prototype js . org/ 



http : //www . rgraph . net/ 
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Figure 3: Start screen of the TwitterExplorer Web application, showing a list of the most 
common hashtags. 



twitterexplorer 



Search hashtags and user; 



#greece 

in 1 09 tweets 




Classification 



Geolocatior [79] 




Category (16) 
Event (0) 



Tweets 

Dpgr 

©wefollow ^cameras # photography #greece 
Aenaos_net 

Just added myself to the http://wefollow.com twitter directory under: ttrealestate #pasok #greece 
microsacial 

day 1 in tfAthens #Greece. Great! Booked trip to #Meteora next Monday ;) Happy!!! 
microsacial 



2009-03-16 13:44:41 



2009-04-27 19:33:58 



2003-05-06 16:41:58 



2009-05-16 08:44:31 



Figure 4: Page for an individual hashtag (#greece) in TwitterExplorer, showing the co- 
occurrence dictionary, overall classification of the hashtag, and the corresponding 
100 tweets used for classification, including the respective classification. Detailed 
information about the classification can be displayed by clicking on the small pie 
charts to the right of each tweet. 
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